This report on the answers of the exercises outline Activity 3 on Exploratory Data Analysis all done in R Studio.
Let’s load the prerequisite libraries, we have assumes that packages has been installed.
library("tidyverse")
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
First, we make sure we have diamonds data which is part of tidyverse library is available. We could get and calculate summary statistics for these variables and plot their distributions.
summary(select(diamonds, x, y, z))
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
ggplot(diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.05)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.05)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.05)
filter(diamonds, x == 0 | y == 0 | z == 0)
diamonds %>%
arrange(desc(x)) %>%
head()
diamonds %>%
arrange(desc(y)) %>%
head()
diamonds %>%
arrange(desc(z)) %>%
head()
ggplot(diamonds, aes(x = x, y = y)) +
geom_point()
ggplot(diamonds, aes(x = x, y = z)) +
geom_point()
ggplot(diamonds, aes(x = y, y = z)) +
geom_point()
filter(diamonds, x > 0, x < 20) %>%
ggplot() +
geom_histogram(mapping = aes(x = x), binwidth = 0.02) +
scale_x_continuous(breaks = 1:20)
filter(diamonds, y > 0, y < 20) %>%
ggplot() +
geom_histogram(mapping = aes(x = y), binwidth = 0.02) +
scale_x_continuous(breaks = 1:20)
filter(diamonds, z > 0, z < 20) %>%
ggplot() +
geom_histogram(mapping = aes(x = z), binwidth = 0.02) +
scale_x_continuous(breaks = 1:20)
summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
geom_histogram(binwidth = 12, center = 0)
ggplot(filter(diamonds), aes(x = price)) +
geom_histogram(binwidth = 100, center = 0)
diamonds %>%
mutate(ending = price %% 10) %>%
ggplot(aes(x = ending)) +
geom_histogram(binwidth = 1, center = 0)
diamonds %>%
mutate(ending = price %% 100) %>%
ggplot(aes(x = ending)) +
geom_histogram(binwidth = 1)
diamonds %>%
mutate(ending = price %% 1000) %>%
filter(ending >= 500, ending <= 800) %>%
ggplot(aes(x = ending)) +
geom_histogram(binwidth = 1)
diamonds %>%
filter(carat >= 0.99, carat <= 1) %>%
count(carat)
diamonds %>%
filter(carat >= 0.9, carat <= 1.1) %>%
count(carat) %>%
print(n = Inf)
## # A tibble: 21 x 2
## carat n
## <dbl> <int>
## 1 0.9 1485
## 2 0.91 570
## 3 0.92 226
## 4 0.93 142
## 5 0.94 59
## 6 0.95 65
## 7 0.96 103
## 8 0.97 59
## 9 0.98 31
## 10 0.99 23
## 11 1 1558
## 12 1.01 2242
## 13 1.02 883
## 14 1.03 523
## 15 1.04 475
## 16 1.05 361
## 17 1.06 373
## 18 1.07 342
## 19 1.08 246
## 20 1.09 287
## 21 1.1 278
ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = price)) +
xlim(100, 5000) +
ylim(0, 3000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14714 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ggplot(diamonds2, aes(x = y)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).
diamonds %>%
mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
ggplot() +
geom_bar(mapping = aes(x = cut))
mean(c(0, 1, 2, NA), na.rm = TRUE)
## [1] 1
sum(c(0, 1, 2, NA), na.rm = TRUE)
## [1] 3
#> [1] 3
Let’s load the library
library("nycflights13")
Looks like while non-cancelled flights happen at similar frequency in mornings and evenings, cancelled flights happen at a greater frequency in the evenings.
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x=sched_dep_time, y=..density..)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = .25)+
xlim(c(5,25))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 4 row(s) containing missing values (geom_path).
Let’s look at the same plot but smooth the distributions to make the pattern easier to see.
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x=sched_dep_time)) +
geom_density(mapping = aes(fill = cancelled), alpha = 0.30)+
xlim(c(5,25))
## Warning: Removed 1 rows containing non-finite values (stat_density).
We will try boxplot on how it will look like.
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot() +
geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))
carat is the most important for predicting price.
cor(diamonds$price, select(diamonds, carat, depth, table, x, y, z))
## carat depth table x y z
## [1,] 0.9215913 -0.0106474 0.1271339 0.8844352 0.8654209 0.8612494
fair cut seem to associate with a higher carat thus while lower quality diamonds may be selling for more that is being driven by the carat of the diamond (the most important factor in price) and the quality simply cannot offset this.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)), orientation = "x")
diamonds %>%
mutate(color = fct_rev(color)) %>%
ggplot(aes(x = color, y = price)) +
geom_boxplot()
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price))
ggplot(data = diamonds, aes(x = cut, y = carat)) +
geom_boxplot()
ggplot(data = diamonds, aes(x = cut, y = carat))+
geom_boxplot()+
coord_flip()
# Remember to install with install.packages("ggstance")
library("ggstance")
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
ggplot(data = mpg) +
geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))
ggplot(data = mpg) +
geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))
ggplot(data = mpg) +
geom_boxploth(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy), orientation = "y")
## Warning: Ignoring unknown parameters: orientation
library("lvplot")
ggplot(diamonds, aes(x = cut, y = price)) + geom_lv()
Perhaps a helpful way to understand this is to see what it looks like at different specified ‘k’ values (which) You can see the letters when you add fill = ..LV.. to the aesthetic.
diamonds %>%
ggplot()+
lvplot::geom_lv(aes(x = cut, y = price, alpha = ..LV..), fill = "blue")+
scale_alpha_discrete(range = c(0.7, 0))
## Warning: Using alpha for a discrete variable is not advised.
diamonds %>%
ggplot()+
lvplot::geom_lv(aes(x = cut, y = price, fill = ..LV..))
Letters represent ‘median’, ‘fourths’, ‘eights’…
I produce plots for these three methods below. The geom_freqpoly() is better for look-up: meaning that given a price, it is easy to tell which cut has the highest density. However, the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other. The geom_violin() and faceted geom_histogram() have similar strengths and weaknesses. It is easy to visually distinguish differences in the overall shape of the distributions (skewness, central values, variance, etc). However, since we can’t easily compare the vertical values of the distribution, it is difficult to look up which category has the highest density for a given price. All of these methods depend on tuning parameters to determine the level of smoothness of the distribution.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_histogram() +
facet_wrap(~cut, ncol = 1, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_violin() +
coord_flip()
ggplot(diamonds,aes(x = cut, y = carat))+
geom_violin()
ggplot(diamonds, aes(x = carat, y = ..density..))+
geom_histogram()+
facet_wrap(~cut)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I like how geom_freqpoly has points directly overlaying but it can also be tough to read some, and the lines can overlap and be tough to tell apart, you also have to specify density for this and geom_histogram whereas for geom_violin it is the default. The tails in geom_violin can be easy to read but they also pull these for each of the of the values whereas by faceting geomo_histogram and setting scales = “free” you can have independent scales. I think the biggest advantage of the histogram is that it is the most familiar so people will know what you’re looking at.
There are two methods:
geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points. I’ll use the mpg box plot example since these methods display individual points, they are better suited for smaller datasets.
library(ggbeeswarm)
ggplot(mpg, aes(x = displ, y = cty, color = drv))+
geom_point()
ggplot(mpg, aes(x = displ, y = cty, color = drv))+
geom_jitter()
ggplot(mpg, aes(x = displ, y = cty, color = drv))+
geom_beeswarm()
## Warning in f(...): The default behavior of beeswarm has changed in version
## 0.6.0. In versions <0.6.0, this plot would have been dodged on the y-axis. In
## versions >=0.6.0, grouponX=FALSE must be explicitly set to group on y-axis.
## Please set grouponX=TRUE/FALSE to avoid this warning and ensure proper axis
## choice.
ggplot(data = mpg) +
geom_beeswarm(mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
))
ggplot(mpg, aes(x = displ, y = cty, color = drv))+
geom_quasirandom()
## Warning in f(...): The default behavior of beeswarm has changed in version
## 0.6.0. In versions <0.6.0, this plot would have been dodged on the y-axis. In
## versions >=0.6.0, grouponX=FALSE must be explicitly set to group on y-axis.
## Please set grouponX=TRUE/FALSE to avoid this warning and ensure proper axis
## choice.
geom_jitter is similar to geom_point but it provides random noise to the points. You can control these with the width and height arguments. This is valuable as it allows you to better see points that may overlap one another. geom_beeswarm adds variation in a uniform pattern by default across only the x-axis. geom-quasirandom also defaults to distributing the points across the x-axis however it produces quasi-random variation, ‘quasi’ because it looks as though points follow some interrelationship21 and if you run the plot multiple times you will get the exact same plot whereas for geom_jitter you will get a slightly different plot each time. To see the differences between geom_beeswarm and geom_quasirandom` it’s helpful to look at the plots above, but holding the y value constant at 1.
ggplot(data = mpg) +
geom_quasirandom(mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
))
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "tukey"
)
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "tukeyDense"
)
ggplot(data = mpg) +
geom_quasirandom(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
),
method = "smiley"
)
Proportion cut in color: (change group_by() to group_by(cut, color) to set-up the converse)
cut_in_color_graph <- diamonds %>%
group_by(color, cut) %>%
summarise(n = n()) %>%
mutate(proportion_cut_in_color = n/sum(n)) %>%
ggplot(aes(x = color, y = cut))+
geom_tile(aes(fill = proportion_cut_in_color))+
labs(fill = "proportion\ncut in color")
## `summarise()` regrouping output by 'color' (override with `.groups` argument)
cut_in_color_graph
Similarly, to scale by the distribution of color within cut,
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(prop = n / sum(n)) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = prop))
I add limit = c(0, 1) to put the color scale between (0, 1). These are the logical boundaries of proportions. This makes it possible to compare each cell to its actual value, and would improve comparisons across multiple plots. However, it ends up limiting the colors and makes it harder to compare within the dataset. However, using the default limits of the minimum and maximum values makes it easier to compare within the dataset the emphasizing relative differences, but harder to compare across datasets.
flights %>%
group_by(month, dest) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
geom_tile() +
labs(x = "Month", y = "Destination", fill = "Departure Delay")
## `summarise()` regrouping output by 'month' (override with `.groups` argument)
There are several things that could be done to improve it,
sort destinations by a meaningful quantity (distance, number of flights, average delay) remove missing values How to treat missing values is difficult. In this case, missing values correspond to airports which don’t have regular flights (at least one flight each month) from NYC. These are likely smaller airports (with higher variance in their average due to fewer observations). When we group all pairs of (month, dest) again by dest, we should have a total count of 12 (one for each month) per group (dest). This makes it easy to filter.
I improved the original graph by adding in a filter so that only destinations that received over 10000 flights were included:
flights %>%
group_by(dest, month) %>%
summarise(delay_mean = mean(dep_delay, na.rm=TRUE),
n = n()) %>%
mutate(sum_n = sum(n)) %>%
select(dest, month, delay_mean, n, sum_n) %>%
as.data.frame() %>%
filter(dest == "ABQ") %>%
#the sum on n will be at the dest level here
filter(sum_n > 30) %>%
ggplot(aes(x = as.factor(month), y = dest, fill = delay_mean))+
geom_tile()
## `summarise()` regrouping output by 'dest' (override with `.groups` argument)
Another way to improve it may be to group the destinations into regions. This also will prevent you from filtering out data. We aren’t given region information, but we do have lat and long points in the airports dataset.
If you’re comparing the proportion of cut in color and want to be looking at how the specific cut proportion is changing, it may easier to view this while looking left to right vs. down to up. Compare the two plots below.
cut_in_color_graph
cut_in_color_graph+
coord_flip()
It’s usually better to use the categorical variable with a larger number of categories or the longer labels on the y axis. If at all possible, labels should be horizontal because that is easier to read.
Both cut_width() and cut_number() split a variable into groups. When using cut_width(), we need to choose the width, and the number of bins will be calculated automatically. When using cut_number(), we need to specify the number of bins, and the widths will be calculated automatically.
In either case, we want to choose the bin widths and number to be large enough to aggregate observations to remove noise, but not so large as to remove all the signal.
If categorical colors are used, no more than eight colors should be used in order to keep them distinct. Using cut_number, I will split carats into quantiles (five groups).
ggplot(
data = diamonds,
mapping = aes(color = cut_number(carat, 5), x = price)
) +
geom_freqpoly() +
labs(x = "Price", y = "Count", color = "Carat")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You should keep in mind how many lines you are going to create, they may overlap each other and look busy if you’re not careful.
smaller <- diamonds %>%
filter(carat < 3)
ggplot(smaller, aes(x = price)) +
geom_freqpoly(aes(colour = cut_number(carat, 10)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For the visualization below I wrapped it in the funciton plotly::ggplotly(). This funciton wraps your ggplot in html so that you can do things like hover over the points.
p <- ggplot(smaller, aes(x=price))+
geom_freqpoly(aes(colour = cut_width(carat, 0.25)))
plotly::ggplotly(p)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(x = price, y = carat))+
geom_violin(aes(group = cut_width(price, 2500)))
Plotted with a box plot with 10 bins with an equal number of observations, and the width determined by the number of observations.
ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
geom_boxplot() +
coord_flip() +
xlab("Price")
diamonds %>%
mutate(percent_rank = percent_rank(carat),
small = percent_rank < 0.025,
large = percent_rank > 0.975) %>%
filter(small | large) %>%
ggplot(aes(large, price)) +
geom_violin()+
facet_wrap(~large)
Small diamonds have a left-skewed price distribution, large diamonds have a right skewed price distribution.
However, the distribution of very large diamonds is more variable. I am not surprised, since I knew little about diamond prices. After the fact, it does not seem surprising (as many thing do). I would guess that this is due to the way in which diamonds are selected for retail sales. Suppose that someone selling a diamond only finds it profitable to sell it if some combination size, cut, clarity, and color are above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus we will observe large diamonds with a wider variety of cut, clarity, and color and thus more variability in prices.
There are many options to try, so your solutions may vary from mine. Here are a few options that I tried.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_hex() +
facet_wrap(~cut, ncol = 1)
ggplot(diamonds, aes(x = cut_number(carat, 5), y = price, colour = cut)) +
geom_boxplot()
ggplot(diamonds, aes(colour = cut_number(carat, 5), y = price, x = cut)) +
geom_boxplot()
ggplot(diamonds, aes(x = carat, y = price))+
geom_jitter(aes(colour = cut), alpha = 0.2)+
geom_smooth(aes(colour = cut))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(diamonds, aes(x = carat, y = price))+
geom_boxplot(aes(group = cut_width(carat, 0.5), colour = cut))+
facet_grid(. ~ cut)
diamonds %>%
mutate(carat = cut(carat, 5)) %>%
ggplot(aes(x = carat, y = price))+
geom_boxplot(aes(group = interaction(cut_width(carat, 0.5), cut), fill = cut), position = position_dodge(preserve = "single"))
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
Binned plots give less precise value estimates at each point (constrained by the granularity of the binning) so outliers do not show-up as clearly. They also show less precise relationships between the data. The level of variability (at least with boxplots) can also be tougher to intuit. For example, let’s look at the plot below as a binned boxplot.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut_width(x, 1), y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))